{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Combining data sources\n", "\n", "Data can come from different files and you can combine them in a single dataframe.\n", "\n", "When working with data, storing data across multiple files or data sources is common. Combining data from different files or data sources into a single dataframe can be useful for data analysis and modelling. \n", "\n", "In Pandas, combining data from additional files or sources is done using the merge or join functions. These functions allow us to connect data based on a common key or index, which can be helpful when analyzing related data. \n", "\n", "This notebook will explore how to merge or join data from different files or sources into a single Pandas dataframe. We will also discuss different types of joins and merge operations and how to handle missing or duplicate data during the merging process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How To" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"data/housing.csv\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
\n", "
" ], "text/plain": [ " longitude latitude housing_median_age total_rooms total_bedrooms \\\n", "0 -122.23 37.88 41.0 880.0 129.0 \n", "1 -122.22 37.86 21.0 7099.0 1106.0 \n", "2 -122.24 37.85 52.0 1467.0 190.0 \n", "3 -122.25 37.85 52.0 1274.0 235.0 \n", "4 -122.25 37.85 52.0 1627.0 280.0 \n", "\n", " population households median_income median_house_value ocean_proximity \n", "0 322.0 126.0 8.3252 452600.0 NEAR BAY \n", "1 2401.0 1138.0 8.3014 358500.0 NEAR BAY \n", "2 496.0 177.0 7.2574 352100.0 NEAR BAY \n", "3 558.0 219.0 5.6431 341300.0 NEAR BAY \n", "4 565.0 259.0 3.8462 342200.0 NEAR BAY " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudelongitudeocean_proximity
037.88-122.23NEAR BAY
137.86-122.22NEAR BAY
237.85-122.24NEAR BAY
337.85-122.25NEAR BAY
437.85-122.25NEAR BAY
\n", "
" ], "text/plain": [ " latitude longitude ocean_proximity\n", "0 37.88 -122.23 NEAR BAY\n", "1 37.86 -122.22 NEAR BAY\n", "2 37.85 -122.24 NEAR BAY\n", "3 37.85 -122.25 NEAR BAY\n", "4 37.85 -122.25 NEAR BAY" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_geo = df[[\"latitude\", \"longitude\", \"ocean_proximity\"]]\n", "df_geo.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudelongitudeocean_proximitymedian_house_value
037.88-122.23NEAR BAY452600.0
137.86-122.22NEAR BAY358500.0
237.85-122.24NEAR BAY352100.0
337.85-122.25NEAR BAY341300.0
437.85-122.25NEAR BAY342200.0
...............
2063539.48-121.09INLAND78100.0
2063639.49-121.21INLAND77100.0
2063739.43-121.22INLAND92300.0
2063839.43-121.32INLAND84700.0
2063939.37-121.24INLAND89400.0
\n", "

20640 rows × 4 columns

\n", "
" ], "text/plain": [ " latitude longitude ocean_proximity median_house_value\n", "0 37.88 -122.23 NEAR BAY 452600.0\n", "1 37.86 -122.22 NEAR BAY 358500.0\n", "2 37.85 -122.24 NEAR BAY 352100.0\n", "3 37.85 -122.25 NEAR BAY 341300.0\n", "4 37.85 -122.25 NEAR BAY 342200.0\n", "... ... ... ... ...\n", "20635 39.48 -121.09 INLAND 78100.0\n", "20636 39.49 -121.21 INLAND 77100.0\n", "20637 39.43 -121.22 INLAND 92300.0\n", "20638 39.43 -121.32 INLAND 84700.0\n", "20639 39.37 -121.24 INLAND 89400.0\n", "\n", "[20640 rows x 4 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_geo.join(df[\"median_house_value\"])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "df_price = df[[\"longitude\", \"latitude\", \"median_house_value\"]]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudelongitude_xocean_proximitylongitude_ymedian_house_value
037.88-122.23NEAR BAY-122.23452600.0
137.88-122.23NEAR BAY-122.34350000.0
237.88-122.23NEAR BAY-122.29216700.0
337.88-122.23NEAR BAY-122.28261300.0
437.88-122.23NEAR BAY-122.26391800.0
..................
177570735.96-119.04INLAND-119.0451700.0
177570835.86-119.46INLAND-119.4642700.0
177570935.85-119.12INLAND-119.1258300.0
177571035.89-119.27INLAND-119.2753300.0
177571135.88-119.27INLAND-119.2743700.0
\n", "

1775712 rows × 5 columns

\n", "
" ], "text/plain": [ " latitude longitude_x ocean_proximity longitude_y \\\n", "0 37.88 -122.23 NEAR BAY -122.23 \n", "1 37.88 -122.23 NEAR BAY -122.34 \n", "2 37.88 -122.23 NEAR BAY -122.29 \n", "3 37.88 -122.23 NEAR BAY -122.28 \n", "4 37.88 -122.23 NEAR BAY -122.26 \n", "... ... ... ... ... \n", "1775707 35.96 -119.04 INLAND -119.04 \n", "1775708 35.86 -119.46 INLAND -119.46 \n", "1775709 35.85 -119.12 INLAND -119.12 \n", "1775710 35.89 -119.27 INLAND -119.27 \n", "1775711 35.88 -119.27 INLAND -119.27 \n", "\n", " median_house_value \n", "0 452600.0 \n", "1 350000.0 \n", "2 216700.0 \n", "3 261300.0 \n", "4 391800.0 \n", "... ... \n", "1775707 51700.0 \n", "1775708 42700.0 \n", "1775709 58300.0 \n", "1775710 53300.0 \n", "1775711 43700.0 \n", "\n", "[1775712 rows x 5 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_geo.merge(df_price, left_on=\"latitude\", right_on=\"latitude\")" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
longitudelatitudemedian_house_valuelatitudelongitudeocean_proximity
0-122.2337.88452600.037.88-122.23NEAR BAY
1-122.2237.86358500.037.86-122.22NEAR BAY
2-122.2437.85352100.037.85-122.24NEAR BAY
3-122.2537.85341300.037.85-122.25NEAR BAY
4-122.2537.85342200.037.85-122.25NEAR BAY
.....................
20635-121.0939.4878100.039.48-121.09INLAND
20636-121.2139.4977100.039.49-121.21INLAND
20637-121.2239.4392300.039.43-121.22INLAND
20638-121.3239.4384700.039.43-121.32INLAND
20639-121.2439.3789400.039.37-121.24INLAND
\n", "

20640 rows × 6 columns

\n", "
" ], "text/plain": [ " longitude latitude median_house_value latitude longitude \\\n", "0 -122.23 37.88 452600.0 37.88 -122.23 \n", "1 -122.22 37.86 358500.0 37.86 -122.22 \n", "2 -122.24 37.85 352100.0 37.85 -122.24 \n", "3 -122.25 37.85 341300.0 37.85 -122.25 \n", "4 -122.25 37.85 342200.0 37.85 -122.25 \n", "... ... ... ... ... ... \n", "20635 -121.09 39.48 78100.0 39.48 -121.09 \n", "20636 -121.21 39.49 77100.0 39.49 -121.21 \n", "20637 -121.22 39.43 92300.0 39.43 -121.22 \n", "20638 -121.32 39.43 84700.0 39.43 -121.32 \n", "20639 -121.24 39.37 89400.0 39.37 -121.24 \n", "\n", " ocean_proximity \n", "0 NEAR BAY \n", "1 NEAR BAY \n", "2 NEAR BAY \n", "3 NEAR BAY \n", "4 NEAR BAY \n", "... ... \n", "20635 INLAND \n", "20636 INLAND \n", "20637 INLAND \n", "20638 INLAND \n", "20639 INLAND \n", "\n", "[20640 rows x 6 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df_price, df_geo], join=\"inner\", axis=1)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latitudelongitudeocean_proximity
037.88-122.23NEAR BAY
137.86-122.22NEAR BAY
237.85-122.24NEAR BAY
337.85-122.25NEAR BAY
437.85-122.25NEAR BAY
............
1361934.13-117.25INLAND
1616937.79-122.55NEAR OCEAN
1972639.07-121.70INLAND
933737.97-122.59NEAR OCEAN
478834.02-118.32<1H OCEAN
\n", "

21640 rows × 3 columns

\n", "
" ], "text/plain": [ " latitude longitude ocean_proximity\n", "0 37.88 -122.23 NEAR BAY\n", "1 37.86 -122.22 NEAR BAY\n", "2 37.85 -122.24 NEAR BAY\n", "3 37.85 -122.25 NEAR BAY\n", "4 37.85 -122.25 NEAR BAY\n", "... ... ... ...\n", "13619 34.13 -117.25 INLAND\n", "16169 37.79 -122.55 NEAR OCEAN\n", "19726 39.07 -121.70 INLAND\n", "9337 37.97 -122.59 NEAR OCEAN\n", "4788 34.02 -118.32 <1H OCEAN\n", "\n", "[21640 rows x 3 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_geo.append(df_geo.sample(1000))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise\n", "Familiarize yourself with the merge operations and try how `inner`, `left`, `right`, and `outer` change the merge results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.concat([df_price, df_geo], join=\"inner\", axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Additional Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [Merge, join, concatenate and compare](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }